temporal difference
The surprising efficiency of temporal difference learning for rare event prediction
We quantify the efficiency of temporal difference (TD) learning over the direct, or Monte Carlo (MC), estimator for policy evaluation in reinforcement learning, with an emphasis on estimation of quantities related to rare events. Policy evaluation is complicated in the rare event setting by the long timescale of the event and by the need for \emph{relative accuracy} in estimates of very small values. Specifically, we focus on least-squares TD (LSTD) prediction for finite state Markov chains, and show that LSTD can achieve relative accuracy far more efficiently than MC. We prove a central limit theorem for the LSTD estimator and upper bound the \emph{relative asymptotic variance} by simple quantities characterizing the connectivity of states relative to the transition probabilities between them. Using this bound, we show that, even when both the timescale of the rare event and the relative accuracy of the MC estimator are exponentially large in the number of states, LSTD maintains a fixed level of relative accuracy with a total number of observed transitions of the Markov chain that is only \emph{polynomially} large in the number of states.
Statistical Efficiency of Distributional Temporal Difference Learning
Distributional reinforcement learning (DRL) has achieved empirical success in various domains.One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$.The distributional temporal difference learning has been accordingly proposed, whichis an extension of the temporal difference learning (TD) in the classic RL area.In the tabular case, Rowland et al. [2018] and Rowland et al. [2023] proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference learning (CTD) and quantile temporal difference learning (QTD), respectively.In this paper, we go a step further and analyze the finite-sample performance of distributional TD.To facilitate theoretical analysis, we propose a non-parametric distributional TD learning (NTD).For a $\gamma$-discounted infinite-horizon tabular Markov decision process,we show that for NTD we need $\widetilde O\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance.This sample complexity bound is minimax optimal (up to logarithmic factors) in the case of the $1$-Wasserstein distance.To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest.In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance.
RL without TD learning
In this post, I'll introduce a reinforcement learning (RL) algorithm based on an "alternative" paradigm: divide and conquer We can do Reinforcement Learning (RL) based on divide and conquer, instead of temporal difference (TD) learning. There are two classes of algorithms in RL: on-policy RL and off-policy RL. On-policy RL means we can use fresh data collected by the current policy. In other words, we have to throw away old data each time we update the policy. Algorithms like PPO and GRPO (and policy gradient methods in general) belong to this category.
- Europe > Netherlands > South Holland > Leiden (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
TAMI: Taming Heterogeneity in Temporal Interactions for Temporal Graph Link Prediction
Yu, Zhongyi, Wu, Jianqiu, Wu, Zhenghao, Zhong, Shuhan, Su, Weifeng, Lee, Chul-Ho, Zhuo, Weipeng
Temporal graph link prediction aims to predict future interactions between nodes in a graph based on their historical interactions, which are encoded in node embeddings. We observe that heterogeneity naturally appears in temporal interactions, e.g., a few node pairs can make most interaction events, and interaction events happen at varying intervals. This leads to the problems of ineffective temporal information encoding and forgetting of past interactions for a pair of nodes that interact intermittently for their link prediction. Existing methods, however, do not consider such heterogeneity in their learning process, and thus their learned temporal node embeddings are less effective, especially when predicting the links for infrequently interacting node pairs. To cope with the heterogeneity, we propose a novel framework called TAMI, which contains two effective components, namely log time encoding function (LTE) and link history aggregation (LHA). LTE better encodes the temporal information through transforming interaction intervals into more balanced ones, and LHA prevents the historical interactions for each target node pair from being forgotten. State-of-the-art temporal graph neural networks can be seamlessly and readily integrated into TAMI to improve their effectiveness. Experiment results on 13 classic datasets and three newest temporal graph benchmark (TGB) datasets show that TAMI consistently improves the link prediction performance of the underlying models in both transductive and inductive settings. Our code is available at https://github.com/Alleinx/TAMI_temporal_graph.
- Asia > China > Liaoning Province > Shenyang (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > Texas (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Government (0.92)
- Information Technology (0.67)
- Education > Educational Setting > Online (0.46)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (4 more...)
The surprising efficiency of temporal difference learning for rare event prediction
We quantify the efficiency of temporal difference (TD) learning over the direct, or Monte Carlo (MC), estimator for policy evaluation in reinforcement learning, with an emphasis on estimation of quantities related to rare events. Policy evaluation is complicated in the rare event setting by the long timescale of the event and by the need for \emph{relative accuracy} in estimates of very small values. Specifically, we focus on least-squares TD (LSTD) prediction for finite state Markov chains, and show that LSTD can achieve relative accuracy far more efficiently than MC. We prove a central limit theorem for the LSTD estimator and upper bound the \emph{relative asymptotic variance} by simple quantities characterizing the connectivity of states relative to the transition probabilities between them. Using this bound, we show that, even when both the timescale of the rare event and the relative accuracy of the MC estimator are exponentially large in the number of states, LSTD maintains a fixed level of relative accuracy with a total number of observed transitions of the Markov chain that is only \emph{polynomially} large in the number of states.
Developing the Foundations of Reinforcement Learning
The examples are nothing if not relatable: preparing breakfast, or playing a game of chess or tic-tac-toe. Yet the idea of learning from the environment and taking steps that progress toward a goal apparently was under-studied when ACM A.M. Turing Award recipients Andrew G. Barto and Richard S. Sutton took on the topic in the late 1970s. Eventually, their research led to the creation of reinforcement learning algorithms that sought not to recognize patterns but maximize rewards. Barto and Sutton spoke about how it all unfolded, and what's next for the techniques that are so celebrated for their success in AlphaGo and AlphaZero. Let's start with the earliest days of your collaboration.
- North America > Canada > Alberta (0.05)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.05)
Distributed Value Decomposition Networks with Networked Agents
Varela, Guilherme S., Sardinha, Alberto, Melo, Francisco S.
We investigate the problem of distributed training under partial observability, whereby cooperative multi-agent reinforcement learning agents (MARL) maximize the expected cumulative joint reward. We propose distributed value decomposition networks (DVDN) that generate a joint Q-function that factorizes into agent-wise Q-functions. Whereas the original value decomposition networks rely on centralized training, our approach is suitable for domains where centralized training is not possible and agents must learn by interacting with the physical environment in a decentralized manner while communicating with their peers. DVDN overcomes the need for centralized training by locally estimating the shared objective. We contribute with two innovative algorithms, DVDN and DVDN (GT), for the heterogeneous and homogeneous agents settings respectively. Empirically, both algorithms approximate the performance of value decomposition networks, in spite of the information loss during communication, as demonstrated in ten MARL tasks in three standard environments.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (10 more...)